Bronze Autoloader Generic

Document Version: 1.0
Last Updated: 20-04-2026

`02_bronze_autoloader_generic`

Purpose

02_bronze_autoloader_generic is the shared ingestion notebook that loads source files into a bronze Delta table using Databricks Auto Loader.

It is designed to be reusable across many feeds by changing widget parameters instead of cloning notebook logic.

What the active implementation does

The uploaded version of this notebook performs the following active steps:

Reads widget parameters.
Validates required core values.
Resolves the schema JSON file path relative to the current notebook.
Loads the schema from JSON into a Spark StructType.
Configures a cloudFiles streaming reader.
Applies source-format options such as CSV delimiter and header handling.
Adds standard ingestion metadata columns.
Writes the stream to the target Delta table using availableNow=True.

Read pattern

The notebook uses:

spark.readStream
format cloudFiles
cloudFiles.format = source_format
cloudFiles.schemaLocation = {checkpoint_path}/_schemas
cloudFiles.rescuedDataColumn = rescued_data_column

This is Databricks Auto Loader, which is well suited for incremental file discovery in cloud storage / Unity Catalog volumes.

Write pattern

The notebook writes with:

.format("delta")
.outputMode(output_mode)
.option("checkpointLocation", checkpoint_path)
.option("mergeSchema", str(merge_schema).lower())
.trigger(availableNow=True)
.toTable(target_table_name)

availableNow=True means each job run behaves like a bounded ingestion run that processes all currently available files and then stops.

Metadata columns added by the notebook

The notebook enriches ingested rows with standard bronze metadata:

w_business_ts
w_target_table_name
w_load_type
w_run_date
w_ingest_ts
w_source_file_name
w_ingestion_run_id
w_source_system
w_job_name
w_task_name
w_job_id
w_job_run_id
w_task_run_id
w_job_trigger_type
w_job_start_ts

These fields make downstream traceability and support much easier.

Schema handling

The notebook requires schema_file_path and reads the schema file as JSON.

Path resolution behavior

schema_file_path can be provided in one of these forms:

/Workspace/...
/some/workspace/relative/path
./Schemas/schema_x.json

If the path is relative, the notebook resolves it relative to the notebook directory in the workspace.

This is useful when keeping notebook and schema assets together in the same folder structure.

CSV-specific behavior

When source_format is csv, the notebook applies:

sep = delimiter
header = header
nullValue = null_value

For other source formats, those options are ignored unless you extend the notebook.

Checkpointing and schema tracking

Two storage locations matter:

`checkpoint_path`

Used for Structured Streaming checkpoint state. This path must be stable for the job and should not be casually changed after go-live.

`schema_location`

Derived automatically as:

{checkpoint_path}/_schemas

This is where Auto Loader stores schema tracking information.

Current limitations and reserved parameters

The notebook includes parameters such as:

staging_table_name
business_keys
overwrite_schema
cleanup_stage_after_finalize

The uploaded implementation currently does not actively use those parameters in the live execution path.

There are commented sections that suggest an extended design involving:

staging table writes
row counting from a staged run
high-watermark processing
finalize logic for snapshot / incremental handling
cleanup of staged rows

Because those blocks are commented out, this documentation should not claim that the generic notebook currently performs those steps unless your environment has a modified version.

`load_type` in the current implementation

load_type is currently written as metadata (w_load_type) and forwarded to the audit notebook. In the uploaded active path, it does not yet change the write behavior by itself.

That means values like snapshot and incremental are still useful for lineage and future compatibility, but they do not independently change ingestion semantics in this version unless you extend the notebook.

Operational guidance

Keep one checkpoint path per source/table

Do not share the same checkpoint path across unrelated feeds. Each logical ingestion should have its own checkpoint directory.

Keep schema files under source control

The schema JSON is part of the ingestion contract. Store it with the notebook project and update it through normal change control.

Use precise file patterns

source_file_pattern helps prevent accidentally ingesting unrelated files from the same landing folder.

Use a rescued data column

Keep _rescued_data enabled unless you have a strong reason not to. It helps preserve malformed or unexpected fields for investigation.

Common issues

No files loaded

Possible causes:

landing path is empty
file pattern does not match incoming files
checkpoint already recorded those files
source path is wrong

Schema file not found

Possible causes:

relative path resolved incorrectly
schema file not deployed with the notebooks
incorrect workspace path syntax

Unexpected schema errors

Possible causes:

schema JSON does not match actual file layout
delimiter or header configuration is wrong
merge behavior expectations do not match the notebook's active logic

02_bronze_autoloader_generic

Purpose​

What the active implementation does​

Read pattern​

Write pattern​

Metadata columns added by the notebook​

Schema handling​

Path resolution behavior​

CSV-specific behavior​

Checkpointing and schema tracking​

checkpoint_path​

schema_location​

Current limitations and reserved parameters​

load_type in the current implementation​

Operational guidance​

Keep one checkpoint path per source/table​

Keep schema files under source control​

Use precise file patterns​

Use a rescued data column​

Common issues​

No files loaded​

Schema file not found​

Unexpected schema errors​

`02_bronze_autoloader_generic`

Purpose

What the active implementation does

Read pattern

Write pattern

Metadata columns added by the notebook

Schema handling

Path resolution behavior

CSV-specific behavior

Checkpointing and schema tracking

`checkpoint_path`

`schema_location`

Current limitations and reserved parameters

`load_type` in the current implementation

Operational guidance

Keep one checkpoint path per source/table

Keep schema files under source control

Use precise file patterns

Use a rescued data column

Common issues

No files loaded

Schema file not found

Unexpected schema errors